安装
npx skills add https://github.com/daffy0208/ai-dev-standards --skill 'RAG Implementer'
复制
RAG Implementer
Build production-ready retrieval-augmented generation systems.
Core Principle
RAG = Retrieval + Context Assembly + Generation
Use RAG when you need LLMs to access fresh, domain-specific, or proprietary knowledge that wasn't in their training data.
⚠️ Prerequisites & Cost Reality Check
STOP: Have You Validated the Need for RAG?
Before implementing RAG, confirm:
Problem validated
- Completed
product-strategist
Phase 1 (problem discovery)
Users need AI search
- Tested with simpler alternatives (see below)
ROI justified
- Calculated cost vs benefit of RAG vs alternatives
Try These FIRST (Before RAG)
RAG is powerful but expensive. Try cheaper alternatives first:
1. FAQ Page / Documentation (1 day, $0)
Create well-organized FAQ or docs
Add search with Cmd+F
Works for:
<50 common questions, static content
Test:
Do users find answers? If yes, stop here.
2. Simple Keyword Search (2-3 days, $0-20/month)
Use Algolia, Typesense, or PostgreSQL full-text search
Good enough for 80% of use cases
Works for:
<100k documents, keyword matching sufficient
Test:
Do users get relevant results? If yes, stop here.
3. Manual Curation (Concierge MVP) (1 week, $0)
Manually answer user questions
Build FAQ from common questions
Works for:
<100 users, validating if users want AI
Test:
Do users value your answers enough to pay? If yes, consider RAG.
4. Simple Semantic Search (1 week, $30-50/month)
Use OpenAI embeddings + Postgres pgvector
Skip complex retrieval, re-ranking, etc.
Works for:
<50k documents, basic semantic search
Test:
Are embeddings better than keyword search? If no, stop here.
Cost Reality Check
Naive RAG (Prototype):
Time:
1-2 weeks
Cost:
$50-150/month (vector DB + embeddings + API calls)
When:
Prototype, <10k documents, proof of concept
Advanced RAG (Production):
Time:
3-4 weeks
Cost:
$200-500/month (hybrid search, re-ranking, monitoring)
When:
Production, 10k-1M documents, validated demand
Modular RAG (Enterprise):
Time:
6-8 weeks
Cost:
$500-2000+/month (multiple KBs, specialized modules)
When:
Enterprise, 1M+ documents, mission-critical
Decision Tree: Do You Really Need RAG?
Do users need to search your content?
│
├─ No → Don't build RAG ❌
│
└─ Yes
├─ <50 items? → FAQ page ✅ ($0)
│
└─ >50 items?
├─ Keyword search enough? → Use Algolia ✅ ($0-20/mo)
│
└─ Need semantic understanding?
├─ <50k docs? → Simple semantic (pgvector) ✅ ($30/mo)
│
└─ >50k docs?
├─ Validated with users? → Build RAG ✅
└─ Not validated? → Test with Concierge MVP first ⚠️
Validation Checklist
Only proceed with RAG implementation if:
Tested simpler alternatives (FAQ, keyword search, manual curation)
Users confirmed they need AI-powered search (not just you think they do)
Calculated ROI: cost of RAG < value users get
Have >50k documents OR complex semantic search requirements
Budget: $200-500/month for infrastructure
Time: 3-4 weeks for production implementation
If any checkbox is unchecked:
Go back to
product-strategist
or
mvp-builder
skills to validate first.
See also:
PLAYBOOKS/validation-first-development.md
for step-by-step validation process.
8-Phase RAG Implementation
Phase 1: Knowledge Base Design
Goal
Create well-structured knowledge foundation
Actions
:
Map data sources (internal: docs, databases, APIs / external: web, feeds)
Filter noise, select authoritative content (prevent "data dump fallacy")
Define chunking strategy: semantic chunking based on structure
Add metadata: tags, timestamps, source identifiers, categories
Validation
:
All data sources catalogued and prioritized
Data quality assessed (accuracy, completeness, freshness)
Chunking strategy tested with sample documents
Metadata schema validated for search effectiveness
Common Chunking Strategies
:
Fixed-size: 500-1000 tokens, 50-100 token overlap
Semantic: By paragraph, section headers, or topic boundaries
Recursive: Split by structure (markdown headers, code blocks)
Phase 2: Embedding Strategy
Goal
Choose optimal embedding approach for semantic understanding
Actions
:
Select embedding model:
text-embedding-3-large
(1536 dim) for general, domain-specific for specialized
Plan multi-modal needs (text, code, images, tables)
Decide on fine-tuning: use domain data if general embeddings underperform
Establish similarity benchmarks
Validation
:
Embedding model benchmarked on domain data
Retrieval accuracy tested with known query-document pairs
Storage and compute costs validated
Model Selection
:
General: OpenAI
text-embedding-3-large
,
text-embedding-3-small
Code:
code-search-babbage-code-001
or StarEncoder
Multilingual:
multilingual-e5-large
Phase 3: Vector Store Architecture
Goal
Implement scalable vector database
Actions
:
Choose vector DB (Pinecone, Weaviate, Qdrant, Chroma, pgvector)
Configure index: HNSW for speed, IVF for scale
Plan scalability: data growth and query volume
Implement backup, recovery, security
Validation
:
Vector store benchmarked under expected load
Index optimized for retrieval speed and accuracy
Backup and recovery tested
Security controls implemented
Vector DB Decision
:
Managed cloud → Pinecone
Self-hosted, feature-rich → Weaviate
Lightweight, local → Chroma
Cost-conscious → pgvector (Postgres extension)
High-performance → Qdrant
Phase 4: Retrieval Pipeline
Goal
Build sophisticated retrieval beyond simple similarity search
Actions
:
Implement hybrid retrieval: semantic search + keyword (BM25)
Add query enhancement: expansion, reformulation, multi-query
Apply contextual filtering: metadata, temporal constraints, relevance ranking
Design for query types: factual (precision), analytical (breadth), creative (diversity)
Handle edge cases: no relevant results found
Advanced Techniques
:
Re-ranking
Use cross-encoder after initial retrieval (e.g.,
cross-encoder/ms-marco-MiniLM-L-12-v2
)
Query routing
Route different query types to specialized strategies
Ensemble methods
Combine multiple retrieval approaches
Adaptive retrieval
Adjust top-k based on query complexity
Validation
:
Retrieval accuracy tested across diverse query types
Hybrid retrieval outperforms single-method baselines
Query latency meets requirements (<500ms ideal)
Edge cases and fallbacks tested
Phase 5: Context Assembly
Goal
Transform retrieved chunks into optimal LLM context
Actions
:
Rank and select: prioritize by relevance score, recency, source authority
Synthesize: merge related chunks, avoid redundancy
Compress: use LLMLingua or similar for token optimization
Mitigate "lost in the middle": place critical info at start/end
Adapt dynamically: adjust context based on conversation history
Context Engineering Integration
:
Blend RAG results with system instructions and user prompts
Maintain conversation coherence across multi-turn interactions
Implement context persistence for follow-up queries
Balance context size vs. information density
Validation
:
Context relevance validated against human judgments
Token optimization maintains accuracy
Multi-turn conversations maintain coherence
Assembly latency <200ms
Phase 6: Evaluation & Metrics
Goal
Measure RAG system performance comprehensively
Retrieval Quality
:
Precision@K
Fraction of top-K results that are relevant
Recall@K
Fraction of relevant docs in top-K
MRR (Mean Reciprocal Rank)
Average rank of first relevant result
NDCG
Ranking quality with graded relevance
Generation Quality
:
Faithfulness
Generated content accuracy vs. sources
Answer Relevance
Response relevance to query
Context Utilization
How effectively LLM uses retrieved info
Hallucination Rate
Frequency of unsupported claims
System Performance
:
End-to-End Latency
Query to answer (<3 seconds target)
Retrieval Latency
Time to retrieve and rank (<500ms)
Token Efficiency
Information density per token
Cost Per Query
Combined retrieval + generation costs
Validation
:
Baseline metrics established
A/B testing framework for config comparisons
Automated evaluation pipeline deployed
Human evaluation protocols for ground truth
Phase 7: Production Deployment
Goal
Deploy with enterprise-grade reliability and security
Deployment
:
Containerize with Docker/Kubernetes
Implement load balancing across RAG instances
Add caching for frequent queries
Graceful degradation: fallback to base model on component failure
Security
:
Role-based access controls for knowledge base
Data masking and PII protection
Audit logging for compliance
Prompt injection defense
Monitoring
:
Real-time metrics dashboard (latency, cost, accuracy)
Query analysis for patterns and failure modes
Cost tracking and optimization alerts
Performance profiling for bottlenecks
Validation
:
Production handles expected traffic
Security prevents unauthorized access
Monitoring provides actionable insights
Incident response procedures tested
Phase 8: Continuous Improvement
Goal
Establish processes for ongoing enhancement
Data Pipeline
:
Automated knowledge base updates (real-time or scheduled)
Quality monitoring: detect data drift and degradation
Source diversification: add new data sources
Feedback integration: user corrections and preferences
Model Evolution
:
Evaluate and migrate to improved embeddings
Fine-tune on domain data regularly
Upgrade architecture: Naive → Advanced → Modular RAG
Expand multi-modal support (images, audio, video)
Optimization
:
Analyze query patterns, optimize for common needs
Improve cache hit rates
Tune vector indices regularly
Balance performance vs. costs
Validation
:
Automated improvement pipelines functioning
Performance trends show improvement
User satisfaction increasing
System adapts to changing needs
Key RAG Principles
1. Relevance Over Volume
Quality curation > massive datasets
Remove outdated/low-quality content continuously
Prioritize most relevant info to prevent "lost in the middle"
2. Semantic Understanding
Use embeddings for true semantic matching, not just keywords
Recognize query intent (factual, analytical, creative)
Adapt retrieval strategy based on context
3. Multi-Modal Intelligence
Handle text, images, code, tables, structured data
Enable cross-modal retrieval (text query → image results)
Preserve document structure and formatting
4. Temporal Awareness
Prioritize recent info for time-sensitive topics
Maintain historical access when relevant
Integrate real-time data feeds for dynamic domains
5. Transparency & Trust
Always provide source citations
Indicate confidence levels
Explain why specific information was selected
Standard RAG Response Format
{
"answer"
:
"Generated response incorporating retrieved information"
,
"sources"
:
[
{
"content"
:
"Retrieved text chunk"
,
"source"
:
"Document/URL identifier"
,
"relevance_score"
:
0.95
,
"chunk_id"
:
"unique_identifier"
}
]
,
"confidence"
:
0.87
,
"retrieval_metadata"
:
{
"chunks_retrieved"
:
5
,
"retrieval_time_ms"
:
150
,
"generation_time_ms"
:
800
}
}
Critical Success Rules
Non-Negotiable
:
✅ Source attribution for every response
✅ Validate generated content against sources (prevent hallucination)
✅ Filter sensitive data before retrieval
✅ Respond within latency thresholds (<3 seconds)
✅ Monitor and optimize costs continuously
✅ Comply with security policies
✅ Graceful degradation on failures
✅ Comprehensive testing before production
Quality Gates
:
Before Production: >85% accuracy on evaluation dataset
Ongoing: User satisfaction >4.0/5.0
Performance: 95th percentile <5 seconds
Reliability: 99.5% uptime
Cost: Within 10% of budget
Advanced Patterns
Modular RAG Architecture
Search Module
Query understanding and reformulation
Memory Module
Long-term conversation persistence
Routing Module
Query routing to specialized knowledge bases
Predict Module
Anticipatory pre-loading based on context
Hybrid RAG + Fine-tuning
RAG for dynamic, frequently changing knowledge
Fine-tuning for domain-specific reasoning patterns
Combine strengths for maximum effectiveness
Related Resources
← 返回排行榜